Skip to content

Conversation

ziw-liu
Copy link
Contributor

@ziw-liu ziw-liu commented Jul 3, 2025

Needs czbiohub-sf/iohub#311

To be investigated: multiprocessing based parallelism is not compatible with the asyncio-based thread parallelism that zarr-python is designed for and appears to be a bit slower.

@ziw-liu
Copy link
Contributor Author

ziw-liu commented Jul 3, 2025

As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism.

@ziw-liu
Copy link
Contributor Author

ziw-liu commented Jul 10, 2025

Ran into zarr-developers/zarr-python#3221.

@ziw-liu
Copy link
Contributor Author

ziw-liu commented Jul 11, 2025

As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism.

606a0c4 now taking about 2 minutes.

pyproject.toml Outdated
"scikit-learn",
]

[project.optional-dependencies]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this any different than the one from PyPI?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I'm guessing no? @tayllatheodoro may know better

sbatch_filepath: str = None,
sbatch_filepath: str | None = None,
local: bool = False,
block: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the motivation for including the block parameter? Was it useful during testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running locally there isn't a good way to check if the jobs (processes) have finished. It is also useful for testing.

@mattersoflight mattersoflight added this to the Data Infrastructure milestone Aug 14, 2025
@edyoshikun edyoshikun requested a review from srivarra August 18, 2025 16:18
@ieivanov
Copy link
Collaborator

ieivanov commented Sep 4, 2025

Current plan is that this PR will be merged after czbiohub-sf/iohub#301, updating the iohub dependency to the main branch.

Copy link
Collaborator

@srivarra srivarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I was able to run biahub concatenate -c rechunk.yml -o test.zarr -sb sbatch.sh on a dataset which hasn't been converted from OME-NGFF v0.4/Zarr V2 to OME-NGFF v0.5/Zarr V3 over here /hpc/projects/intracellular_dashboard/organelle_dynamics/rerun/2025_04_15_A549_H2B_CAAX_ZIKV_DENV/2-assemble/zarr-v3.

@ziw-liu
Copy link
Contributor Author

ziw-liu commented Sep 12, 2025

Blocked until we bump waveorder:

ERROR: Cannot install None, biahub and biahub[dev]==0.1.0 because these package versions have conflicting dependencies.
The conflict is caused by:
    biahub 0.1.0 depends on iohub<0.4 and >=0.3.0a2
    biahub[dev] 0.1.0 depends on iohub<0.4 and >=0.3.0a2
    waveorder 3.0.0a1 depends on iohub<0.3 and >=0.2

@ziw-liu
Copy link
Contributor Author

ziw-liu commented Sep 12, 2025

Another blocker is napari-psf-anlysis -> bfio -> zarr<3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants